MSc in Data Analytics

Individual Project CA1

Author:

Student Name Student Number Email address
Vitor Notaro sba20229 vitornotaro34@gmail.com

Lecturer:

Lecturer Name Module Email address
David McQuaid Data Preparation & Visualisation for Data Analytics dmcquaid@cct.ie
David McQuaid Programming for Data Analytics dmcquaid@cct.ie
Dr. Muhammad Iqbal Machine Learning for Data Analytics miqbal@cct.ie
Marina Iantorno Statistics for Data Analytics miantorno@cct.ie

Table of contents:

  1. CA1 Information
  2. Initial Exploratory and Data Analysis
  3. Statistics Questions
  4. Folium for Visualization (Dublin Bike Map)
  5. Data Preparation and Feature Engineering
  6. Re-Shaping Data and Build M Learning Models
  7. Dimensionality Reduction using Principal Components Analyses
  8. Hierarchy Clustering
  9. KMeans Clustering
  10. Comparing models with silhouette score
  11. Imputing K Means Labels and evaluating results with Data Visualization
  12. Conclusions
  13. References

Initial Exploratory and Data Analysis

Central Tendency Measures

These measures will allow us to summarize in a single number all the values reflecting the centre of the data distribution

Variation Measures

These measures help us to get values that determine the level of homogeneity within the observations.

In other words, we can see how different/similar the values are.

Dataset Insights

We have no Missing Data in any feature

STATUS feature has constant value "Open" (does not add value to analysis)

STATION ID is a unique identifier of each Station (does not add value to analysis)

ADDRESS has a high cardinality: 109 distinct values (does not add value to analysis once we have Lat and Long)

AVAILABLE BIKES is the only feature that presents outliers (around 1% of the values are considered outliers), however I could not find any values over 40 as that is the maximum possible value once there are no Bike stations with more than 40 stands.

It was not possible to find evidence of errors since in none of the observations the number of bicycles available at the station was greater than the total number of Stands. For that reason I am going to consider the outliers acceptable and part of the nature of the operation of the business.

AVAILABLE BIKE STANDS is skewed but close to Normal Distribution.

Statistics Questions

As I am able to use descriptive analyses in order to understand more about the dataset I could also use those insights to calculate probabilities.

A probability distribution is a statistical function that is used to show all the possible values and likelihoods of a random variable in a specific range.

Binomial probability distribution is useful to answer questions such as:

It is known by my analyses that 27 or 25% of the Dublin Bike Stations have 30 Stands. If we randomly choose 5 Stations:

What is the probability to find exactly 3 Stations with 30 Stands ?

As this is a random variable, we need a definition.

The structure of this variable is:

X = number of elements with a characteristic/attribute (within a limit)

X = number of Stations with 30 Stands (within 5 Stations)

n = 5
p = 0.25
q = 0.75

Formula

Using Python we can answer this question : 9% is the answer.

We can also use the descriptive analyses to calculate probability in a variable normally distributed.

The Normal distribution calculates a cumulation of probabilities, therefore we cannot calculate the probability to get an exact value, it will always be greater or less than, but never equals to.

The Normal distribution is always symmetric, which means that the curve will never be skewed to one side, and the expected value (average) is always in the middle of the bell curve.

Normal probability distribution are useful to answer questions such as:

Assuming that the number of AVAILABLE BIKE STANDS across Dublin Bike Stations are normally distributed with an average of 20 and a standard deviation of 10 points.

What is the probability of getting one Station with more than or equal to 15 Stands Available?

Formula

Using Python we can answer this question : 69% is the answer.

Using Longitude and Latitude to visualize Dublin Bike stations in a MAP with Folium

After visualizing all stations in a Map and thinking about all the data exploration steps previously made, it is clear a good approach would be using this data set to perform Unsupervised Learning.

Unsupervised learning is a type of Machine Learning algorithm that learns patterns from untagged data. The goal of the algorithm is to find relationships within the data and group data points based on the input data.

Using unsupervised ML we could address questions such as the below:

There are differences or similarities between Dublin Bike Stations ?

Is it possible to cluster the Stations based on mean usage ?

How many groups would we have ?

Once I have defined a target lets get started with data preparation and feature engineering.

Data Preparation and Feature engineering

As I have two similar columns that refer to times and both carry out similar information with small differences around 3 or 4 minutes, I am going to choose LAST_UPDATED as the main feature for Date Time because it carries the appropriate picture of the exact date and time.

However, I have seen from previous steps this column is set up as an object type, for that reason I am going to convert to DateTime type.

Extracting important features from Date Time column can be useful, once we might have big differences between usage on weekends.

In my understanding it is likely that I am searching for days which could potentially be considered outliers as they are distant from the average usage.

In order to make this analysis I am going to create two new columns DAY_NUMBER (starting with 0 = Monday) and DAY_TYPE (if Weekday of Sunday/Saturday)

As there are differences between the updated times across each observation its going to be important to create a precise and standard time in order to be able to compare the stations.

In my conception this will be a way to improve the distribution once the previous EDA exploration confirmed that the column LAST_UPDATED has a high cardinality: 1311282 distinct values

To do that I am going to create a new feature called TIME_ROUNDED_10_MIN using LAST_UPDATED as a reference, after that I am going to extract only the rounded time from this new feature and I am going to call it NEW TIME

Because each station has a different amount of Bike Stands, I am going to create a new variable called PERCENTAGE_OCCUPANCY in order to enable a comparison of the situation of each Station across my time line.

This would be a way to put all stations on the same page in order to compare fairly.

Data cleansing is also an important step. As I have many redundant columns not useful for my exercise I am going to drop them.

Re-Shaping Data and Build M Learning Models

At this stage its important to mention that with my dataset in the current format, my observations are the specific events that took place in each station in a very specific time frame.

In order to perform some analyses its important to reshape my data adjusting my observations for what I am interested to analyze.

For the first exercise I am going to perform a Hierarchy Cluster model to check which days of the week are similar to each other and see if Clusters among the different days of the week are generated.

In order to do that I am going to Pivot my data frame setting the DAY_NUMBER as my observations defining my NEW_TIME column as the attributes, populating my attribute values with the mean of PERCENTAGE_OCCUPANCY

A dendrogram is used to represent the relationship between objects.

It is used to display the distance between each pair of sequentially merged objects in a feature space.

Dendrograms are commonly used in studying the hierarchical clusters before deciding the appropriate number of clusters for a dataset.

The above plot confirmed the weekends (Saturday 5 and Sunday 6) have similar patterns.

We can also see that Tuesdays(1) and Thursdays (3) are likely similar.

Wednesdays (2), Fridays (4) and Mondays (0) are also similar.

However, one common approach is to analyze the dendrogram and look for groups that combine at a higher dendrogram distance.

The Grey dashed draw line crosses three groups and suggests a distance of 0.01 (from 0.05 to 0.06)

The Red dashed draw line crosses two groups and suggests a distance greater than 0.09 (from 0.07 to 0.16)

For this reason two cluster (weekends and workdays) is the most appropriate.

An easy way to visualize the trend differences is by plotting a Line chart.

Using line charts to represent time series is generally accepted practice, however, the dots are frequently omitted altogether.

The above chart confirmed the weekends have trends and patterns completely different from the workdays.

For that reason I am assuming those two days are outliers when compared with the workdays, and I am going to exclude them from my next exercise creating a new Data Frame with DAY_TYPE = Weekday

As previously explained, the current dataset observations are the specific events that took place in each station in a very specific time frame.

In order to visualize the mean trend of each station across the Time Frame I am going to adjust my observations using a Pivot Table function setting the NEW_TIME as my observations defining my STATIONS column as the attributes, populating my attribute values with mean PERCENTAGE_OCCUPANCY

I am conscious this visualization will be very poor and extremely difficult to interpret, but that is exactly the point I am trying to prove.

As we have 109 different Stations for our human eyes it is impossible to find out trends and patterns only looking at that Line Chart.

As previously exposed, I am going to need a better way to cluster those stations. For that reason I am going to use Unsupervised Learning models to complete this task.

In order to perform this exercise I am going to reshape my observations using a Pivot Table function setting the NAME as my observations (because the Stations are the objects I am interested in clustering) defining my NEW_TIME column as the attributes, populating my attribute values with mean PERCENTAGE_OCCUPANCY

For obvious reasons I ended up with 109 observations (my 109 stations) and 144 columns.

At this stage I decided to test if dimensionality reduction would be applicable for my new data set without losing its properties.

The questions to be addressed are:

Is it possible to perform dimensionality reduction on this dataset without losing its properties?

How many components would I need to explain at least 90% of the Variance of my Station Dataset ?

In order to answer those question I decided to apply PCA (Principal Components Analysis)

Principal component analysis (PCA) is an unsupervised learning method.

Dimensionality Reduction using Principal Components Analyses

We can see that the first component explain 68% of the variance and the second component 23% and both together explain 92%

I am going to create a new dataset called df2 containing the two principal components for each observation.

Lets plot our Principal components to see if its possible to identify the clusters

Although it was possible to perform dimensionality reduction, I am still not able to identify clearly the cluster.

For that reason I am going use unsupervised Machine Learning Models to complete this task.

For this next exercise I am going to try two algorithms (Hierarchy+Agglomerative and KMeans) measuring and comparing the results using Silhouette score.

Hierarchy Clustering

The above plot confirmed some stations have similar patterns.

As previously mentioned one common approach is to analyze the dendrogram and look for groups that combine at a higher dendrogram distance.

The Grey dashed draw line crosses three groups and suggests a distance around 19 (from 56 to 75) when it touches the red line.

The Red dashed draw line crosses two groups and suggests a distance greater than at least 30 (from 75 to 110)

For this reason two clusters are the most appropriate number.

I am going to create a new data frame assigning the Hierarchy Cluster labels for each observations

I am going to plot my new dataframe in order visualize the groups

KMeans Clustering

I am going to attempt using KMeans to see the results I can get from this model, K means also works with Euclidean distances

However, differently of Hierarchy, KMeans algorithm requires the number of clusters to be specified.

For that reason I am going to use Elbow method to get insights of the possible number of clusters for parameter K

From the above plot I can see the line starts to become flat around 4, but could also be 5 or 6

For this reason I am going to test the 3 possible numbers for K and measure the Silhouette score for each number of clusters

Comparing models with silhouette score

Silhouette Coefficient also know as silhouette score is a metric used to calculate the efficiency of a clustering technique.

Its value ranges from -1 to 1.

1: Means clusters are well apart from each other and clearly distinguished.

0: Means clusters are indifferent, or we can say that the distance between clusters is not significant.

-1: Means clusters are assigned in the wrong way.

The Silhouette Score test suggests K Means with 5 Cluster as the best option with a higher score than Hierarchy with only 2 clusters.

I am going to run the model with 5 Clusters and plot the results to see how they look like.

As previously mentioned KMeans with 5 Clusters has a better score than Hierarchy with only 2 clusters.

For that reason I am going to proceed to the next steps using only KMeans.

Imputing K Means Labels and evaluating results with Data Visualization

I am going to check how many Stations each Cluster has.

Following the same principle in order to visualize the mean trend of each Cluster across the Time Frame.

I am going to adjust my observations using a Pivot Table function setting the NAME as my observations defining my NEW_TIME column as the attributes, populating my attribute values with mean PERCENTAGE_OCCUPANCY

This time I hope the visualization will be better and easy to interpret.

As we have only 5 different groups, our human eyes will be able to find trends and patterns by analyzing the Line Chart.

Clonclusion

Upon review, I found that the CRISP_DM framework is an excellent tool to keep focus on the tasks at hand and I will seek to make use of this in future projects.

Prior to signing off on this project, reassessments were made to ensure that all proposals and objectives initially addressed were met and that all questions were answered appropriately with sufficient evidence and rationale for decisions made .

It is evident that Data Science applied in any dataset can help us to better understand the world and make better decisions as human beings, with a mindset of recognising the impacts for future generations.

Some points to highlight regarding this assessment is the importance of: • having a good methodology. • having the right tools available. • discipline for reading, researching, and developing skills and the necessary knowledge in Programming, Statistics, Data Preparation, Machine Learning and Data Visualization.

With the project findings presented, it is clear that further research could be made to progress more about the topic. There are clearly some correlations between the different groups of Dublin Bike Stations and maybe opportunities for a better distribution of the Bikes among the stations to improve the service provided. This might become a topic for future discussions.

References:

Agresti, A. and Kateri, M. (2021). Foundations of statistics for data scientists : with R and Python. First Edition ed. Boca Raton: Crc Press.

Bhardwaj, A. (2020). Silhouette Coefficient : Validating clustering techniques. [online] Medium. Available at: https://towardsdatascience.com/silhouette-coefficient-validating-clustering-techniques-e976bb81d10c

Blackmist (n.d.). Train and deploy a reinforcement learning model (preview) - Azure Machine Learning. [online] docs.microsoft.com. Available at: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-use-reinforcement-learning

Breslin, R. (2020). What Dublin Bikes data can tell us about the city and its people. [online] Medium. Available at: https://towardsdatascience.com/what-dublin-bikes-data-can-tell-us-about-the-city-and-its-people-63fde77ee383

C Wilke (2019). Fundamentals of data visualization : a primer on making informative and compelling figures. Sebastopol, Ca: O’reilly Media.

Chun-Houh Chen and Al, E. (2016). Handbook of data visualization. Berlin: Springer.

Connors, L. (2021). Creating a Simple Map with Folium and Python. [online] Medium. Available at: https://towardsdatascience.com/creating-a-simple-map-with-folium-and-python-4c083abfff94

Das, A. (2020). Hierarchical Clustering in Python using Dendrogram and Cophenetic Correlation. [online] Medium. Available at: https://towardsdatascience.com/hierarchical-clustering-in-python-using-dendrogram-and-cophenetic-correlation-8d41a08f7eab

data.smartdublin.ie. (n.d.). Dublinbikes DCC - data.smartdublin.ie. [online] Available at: https://data.smartdublin.ie/dataset/analyze/33ec9fe2-4957-4e9a-ab55-c5e917c7a9ab

dataprep.ai. (n.d.). DataPrep — The easiest way to prepare data in Python. [online] Available at: https://dataprep.ai/

fontawesome.com. (n.d.). Font Awesome. [online] Available at: https://fontawesome.com/icons?d=gallery

Galarnyk, M. (2017). PCA using Python (scikit-learn). [online] Medium. Available at: https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60

GeeksforGeeks. (2021). What is Data Visualization and Why is It Important? [online] Available at: https://www.geeksforgeeks.org/what-is-data-visualization-and-why-is-it-important/#:~:text=Data%20visualization%20is%20very%20critical%20to%20market%20research

Grus, J. (2021). DATA SCIENCE FROM SCRATCH : first principles with python. Second ed. O’Reilly.

IBM Cloud Education (2020a). What is Exploratory Data Analysis? [online] www.ibm.com. Available at: https://www.ibm.com/cloud/learn/exploratory-data-analysis.

IBM Cloud Education (2020b). What is Machine Learning? [online] www.ibm.com. Available at: https://www.ibm.com/cloud/learn/machine-learning

James (2017). Usage patterns of Dublin Bikes stations. [online] Medium. Available at: https://towardsdatascience.com/usage-patterns-of-dublin-bikes-stations-484bdd9c5b9e

Jeffares, A. (2019). How I used Machine Learning to improve my Dublin Bikes transit. [online] Medium. Available at: https://towardsdatascience.com/how-i-used-machine-learning-to-improve-my-dublin-bikes-transit-b6bdc7c2b5cb

Jeffares, A. (2021a). Data Science Nanodegree. [online] GitHub. Available at: https://github.com/alanjeffares/data-science-nanodegree/blob/master/dublin-bikes-analysis/get_nearest_available_bike.py

Jeffares, A. (2021b). Data Science Nanodegree. [online] GitHub. Available at: https://github.com/alanjeffares/data-science-nanodegree/blob/master/dublin-bikes-analysis/data_load_processing_viz.ipynb

Lawlor, J. (2021). dublin-bikes-timeseries-analysis. [online] GitHub. Available at: https://github.com/jameslawlor/dublin-bikes-timeseries-analysis

Martinez, J.C. (2021). How to plot your data on maps using Python and Folium. [online] livecodestream.dev. Available at: https://livecodestream.dev/post/how-to-plot-your-data-on-maps-using-python-and-folium/

Mckinney, W. (2018). Python for data analysis : data wrangling with pandas, NumPy, and IPython. Second ed. Sebastopol, Ca: O’reilly Media, Inc., October.

Müller, A.C. and Guido, S. (2017). Introduction to machine learning with Python : a guide for data scientists. Beijing: O’reilly.

Nichani, P. (2020). OutLiers in Machine Learning. [online] Analytics Vidhya. Available at: https://medium.com/analytics-vidhya/outliers-in-machine-learning-e830b2bd8660#:~:text=Outlier%20is%20an%20observation%20that%20appears%20far%20away

plotlygraphs (2019). Line Charts. [online] plotly.com. Available at: https://plotly.com/python/line-charts/

rachelbreslin (2022). dublin_bikes/Dublin Bikes Analysis.ipynb at main · rachelbreslin/dublin_bikes. [online] GitHub. Available at: https://github.com/rachelbreslin/dublin_bikes/blob/main/Dublin%20Bikes%20Analysis.ipynb

Ranjan, A. (2020). Hierarchical Clustering (Agglomerative). [online] Analytics Vidhya. Available at: https://medium.com/analytics-vidhya/hierarchical-clustering-agglomerative-f6906d440981

scikit-learn.org (n.d.). sklearn.metrics.adjusted_rand_score — scikit-learn 0.23.1 documentation. [online] scikit-learn.org. Available at: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html

Scikit-learn.org. (2019). sklearn.preprocessing.StandardScaler — scikit-learn 0.21.2 documentation. [online] Available at: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

Summerfield, M. (2010). Programming in Python 3 : a complete introduction to the Python language. Second ed. Upper Saddle River, New Jersey: Addison-Wesley.

Weiss, N.A. (2017). Introductory Statistics. 10th ed. Pearson Education.

Wilke, C.O. (n.d.). Fundamentals of Data Visualization. [online] clauswilke.com. Available at: https://clauswilke.com/dataviz/directory-of-visualizations.html.

www.dublinbikes.ie. (n.d.). DublinBikes. [online] Available at: https://www.dublinbikes.ie/

www.ibm.com. (n.d.). CRISP-DM Help Overview. [online] Available at: https://www.ibm.com/docs/en/spss-modeler/SaaS?topic=dm-crisp-help-overview.

www.youtube.com. (2018). PyData Dublin: Usage patterns of Dublin Bikes stations - James Lawlor. [online] Available at: https://www.youtube.com/watch?v=59ck_Z75cEY